Goto

Collaborating Authors

 nullx null 1





Adaptive Iterative Soft-Thresholding Algorithm with the Median Absolute Deviation

Feng, Yining, Selesnick, Ivan

arXiv.org Machine Learning

Abstract--The adaptive Iterative Soft-Thresholding Algorithm (IST A) has been a popular algorithm for finding a desirable solution to the LASSO problem without explicitly tuning the regularization parameter λ. Despite that the adaptive IST A is a successful practical algorithm, few theoretical results exist. In this paper, we present the theoretical analysis on the adaptive IST A with the thresh-olding strategy of estimating noise level by median absolut e deviation. We show properties of the fixed points of the algorithm, including scale equivariance, non-uniqueness, and local stability, prove the local linear convergence guarantee, and show its global convergence behavior . Many sparse approximation problems in machine learning and signal processing can be obtained as the solution to the LASSO problem, which can be solved by IST A. Despite its popularity, tuning The obtained LASSO solution is optimal in the mean-squared-error (MSE) sense with minimum assumptions, but LARS is not competitive in terms of computation time for large-scale problems [7].


Proving the Lottery Ticket Hypothesis: Pruning is All You Need

Malach, Eran, Yehudai, Gilad, Shalev-Shwartz, Shai, Shamir, Ohad

arXiv.org Machine Learning

Neural network pruning is a popular method to reduce the size of a trained model, allowing efficient computation during inference time, with minimal loss in accura cy. However, such a method still requires the process of training an over-parameterized network, as trai ning a pruned network from scratch seems to fail (see [ 10 ]). Recently, a work by Frankle and Carbin [ 10 ] has presented a surprising phenomenon: pruned neural networks can be trained to achieve good performance, when resetting their weights to their initial values. Hence, the authors state the lottery ticket hypothesis: a randomly-initialized neural network contains a subnetwork such that, when trained in isolation, can match the performance of the original network. This observation has attracted great interest, with variou s followup works trying to understand this intriguing phenomenon. Specifically, very recent works by Z hou et al. [ 37 ], Ramanujan et al. [ 27 ] presented algorithms to find subnetworks that already achieve good per formance, without any training.


Minimax Nonparametric Two-sample Test

Xing, Xin, Shang, Zuofeng, Du, Pang, Ma, Ping, Zhong, Wenxuan, Liu, Jun S.

arXiv.org Machine Learning

We consider the problem of comparing probability densities between two groups. To model the complex pattern of the underlying densities, we formulate the problem as a nonparametric density hypothesis testing problem. The major difficulty is that conventional tests may fail to distinguish the alternative from the null hypothesis under the controlled type I error. In this paper, we model log-transformed densities in a tensor product reproducing kernel Hilbert space (RKHS) and propose a probabilistic decomposition of this space. Under such a decomposition, we quantify the difference of the densities between two groups by the component norm in the probabilistic decomposition. Based on the Bernstein width, a sharp minimax lower bound of the distinguishable rate is established for the nonparametric two-sample test. We then propose a penalized likelihood ratio (PLR) test possessing the Wilks' phenomenon with an asymptotically Chi-square distributed test statistic and achieving the established minimax testing rate. Simulations and real applications demonstrate that the proposed test outperforms the conventional approaches under various scenarios.


Neural tangent kernels, transportation mappings, and universal approximation

Ji, Ziwei, Telgarsky, Matus, Xian, Ruicheng

arXiv.org Machine Learning

This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK by sampling from transport mappings between the initial weights and their desired values, and the construction of transport mappings via Fourier transforms. Regarding the first contribution, the proof scheme provides another perspective on how the NTK regime arises from rescaling: redundancy in the weights due to resampling allows individual weights to be scaled down. Regarding the second contribution, the most notable transport mapping asserts that roughly $1 / \delta^{10d}$ nodes are sufficient to approximate continuous functions, where $\delta$ depends on the continuity properties of the target function. By contrast, nearly the same proof yields a bound of $1 / \delta^{2d}$ for shallow ReLU networks; this gap suggests a tantalizing direction for future work, separating shallow ReLU networks and their linearization.